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, I Abstract 

^—i When can two regular word languages K and L be separated by a simple 

ly-^ language? We investigate this question and consider separation by piecewise- 

and suffix-testable languages and variants thereof. We give characterizations 

1 I of when two languages can be separated and present an overview of when 
I— ^ these problems can be decided in polynomial time if K and L are given by 

nondeterministic automata. 

o 

' — ' 1 Introduction 

^ In this paper we are motivated by scenarios in which we want to describe something 

\Q complex by means of a simple language. The technical core of our scenarios consists 

of separation problems, which are usually of the following form: 

Given are two languages K and L. Does there exist a language S, coming 
from a family of simple languages, such that S contains everything 
from K and nothing from L? 

o 

The family T of simple languages could be, for example, languages definable in FO, 
I piecewise testable languages, or languages definable with small automata. 

^ Our work is specifically motivated by two seemingly orthogonal problems coming 

• ^ from practice: (a) increasing the user-friendliness of XML Schema and (b) efficient 

approximate query answering. We explain these next. 

Our first motivation comes from simplifying XML Schema. XML Schema is 
currently the only industrially accepted and widely supported schema language for 
XML. Historically, it is designed to alleviate the limited expressiveness of Document 
Type Definition (DTD) [7], thereby making DTDs obsolete. Unfortunately, XML 
Schema's extra expressiveness comes at the cost of simplicity. Its code is designed 
to be machine-readable rather than human-readable and its logical core, based on 
complex types, does not seem well-understood by users |18j . One reason may be that 
the specification of XML Schema's core [9] consists of over 100 pages of intricate 
text. The BonXai schema language [IHllin] is an attempt to overcome these issues 
and to combine the simplicity of DTDs with the expressiveness of XML Schema. It 
has exactly the same expressive power as XML Schema, is designed to be human- 
readable, and avoids the use of complex types. Therefore, it aims at simplifying 
the development or analysis of XSDs. In its core, a BonXai schema is a set of rules 
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Li Ri, . . . , Ln — >■ Rn in which all Li and Ri are regular expressions. An unranked 
tree t (basically, an XML document) is in the language of the schema if, for every 
node u, the word formed by the labels of u's children is in the language Rk, where k 
is the largest number such that the word of ancestors of u is in . This semantical 
definition is designed to ensure full back- and- forth compatibility with XML Schema 

When translating an XML Schema Definition (XSD) into an equivalent BonXai 
schema, the regular expressions Li are obtained from a finite automaton that is 
embedded in the XSD. Since the current state-of-the-art in translating automata to 
expressions does not yet generate sufhciently clean results for our purposes, we are 
investigating simpler classes of expressions which we expect to suffice in practice. 
Practical and theoretical studies show evidence that regular expressions of the form 
Y,*w (with w £ S^) and I]*aiE* • • • E*an (with ai,...,a„ € E) and variations 
thereof seem to be quite well-suited [lOl [HI [20] . We study these kinds of expressions 
in this paper. 

Our second motivation comes from efficient approximate query answering. Ef- 
ficiently evaluating regular expressions is relevant in a very wide array of fields. 
We choose one: in graph databases and in the context of the SPARQL language 
O [m [25 for querying RDF data. Typically, regular expressions are used in this 
context to match paths between nodes in a huge graph. In fact, the data can be so 
huge that exact evaluation of a regular expression r over the graph (which can lead 
to a product construction between an automaton for the expression and the graph 
[161 122j ) may not be feasible within reasonable time. Therefore, as a compromise 
to exact evaluation, one could imagine that we try to rewrite the regular expression 
r as an expression that we can evaluate much more efhciently and is close enough 
to r. Concretely, we could specify two expressions rpos (resp., rncg) that define the 
language we want to (resp., do not want to) match in our answer and ask whether 
there exists a simple query (e.g., defining a piecewise testable language) that satis- 
fies these constraints. Notice that the scenario of approximating an expression r in 
this way is very general and not even limited to databases. (Also, we can take rnog 
to be the complement of Tpos.) 

At first sight, these two motivating scenarios may seem to be fundamentally 
different. In the first, we want to compute an exact simple description of a complex 
object and in the second one we want to compute an approximate simple query that 
can be evaluated more efficiently. However, both scenarios boil down to the same 
underlying question of language separation. Our contributions are: 
(1) We formally define separation problems that closely correspond to the motivat- 
ing scenarios. Query approximation will be abstracted as separation and schema 



simplification as layer-separation (Section 2.1) 



(2) We give a general characterization of separability of languages K and L in terms 
of boolean combinations of simple languages, layer-separability, and the existence 
of an infinite sequence of words that goes back and forth between K and L. This 
characterization shows how the exact and approximate scenario are related and 
does not require K and L to be regular (Sec. |3]). Our characterization generalizes 
a result by Stern [26] that says that a regular language L is piecewise testable iff 
every increasing infinite sequence of words (w.r.t. subsequence ordering) alternates 
finitely many times between L and its complement. 
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(3) In Section|4]we prove a decomposition characterization for separability of regular 
languages by piecewise testable languages and we give an algorithm that decides 
separability. The decomposition characterization is in the spirit of an algebraic 
result by Almeida It is possible to prove our characterization using Almeida's 
result but we provide a self-contained, elementary proof which can be understood 
without a background in algebra. We then use this characterization to distill a 
polynomial time decision procedure for separability of languages of NFAs (or regular 
expressions) by piecewise testable languages. The state-of-the-art algorithm for 
separability by piecewise testable languages ([31 |S]) runs in time 0(poly(|(3|) • 2l^l) 
when given DFAs for the regular languages, where |Q| is the number of states in the 
DFAs and |E| is the alphabet size. Our algorithm runs in time 0(poly(|(5| + |S|)) 
even for NFAs. We explain the connection to 3, 5 more closely in the Appendix. 
Notice that |S]| can be large (several hundreds and more) in the scenarios that 
motivate us, so we believe the improvement with respect to the alphabet to be 
relevant in practice. 

(4) Whereas Section |4] focuses exclusively on separation by piecewise testable lan- 
guages, we broaden our scope in Section [5] Let's say that a subsequence language 
is a language of the form T,* aiT,* ■ ■ ■ I]*a„I]* (with all Ui € S). Similarly, a suffix 
language is of the form S]*ai • • •«„. We present an overview of the complexities 
of deciding whether regular languages can be separated by subsequence languages, 
suffix languages, finite unions thereof, or boolean combinations thereof. We prove 
all cases to be in polynomial time, except separability by a single subsequence lan- 
guage which is NP-complete. By combining this with the results from Section [3] we 
also have that layer-separability is in polynomial time for all languages we consider. 

We now discuss further related work. There is a large body of related work that 
has not been mentioned yet. Piecewise testable languages are defined and studied 
by Simon [231 121] , who showed that a regular language is piecewise testable iff its 
syntactic monoid is J-trivial and iff both the minimal DFA for the language and the 
minimal DFA for the reversal are partially ordered. Stern ^27] suggested an 0{n^) 
algorithm in the size of a DFA to decide whether a regular language is piecewise 
testable. This was improved to quadratic time by Trahtman |25j. (Actually, from 
our proof, it now follows that this question can be decided in polynomial time if an 
NFA and its complement NFA are given.) 

Almeida [3] established a connection between a number of separation problems 
and properties of families of monoids called pseudovarieties. Almeida shows, e.g., 
that deciding whether two given regular languages can be separated by a language 
with its syntactic monoid lying in pseudovariety V is algorithmically equivalent to 
computing two-pointlike sets for a monoid in pseudovariety V. It is then shown by 
Almeida et al. [3| how to compute these two-pointlike sets in the pseudovariety J 
corresponding to piecewise testable languages. Henckell et al. [12] and Steinberg [25] 
show that the two-pointlike sets can be computed for pseudovarieties corresponding 
to languages definable in first order logic and languages of dot depth at most one, 
respectively. By Almeida's result [3] this implies that the separation problem is also 
decidable for these classes. 
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2 Preliminaries and Definitions 



For a finite set S, we denote its cardinaHty by \S\. By S we always denote an 
alphabet, that is, a finite set of symbols. A (E-)word w is a finite sequence of 
symbols ai---a„, where n > and a,; G S for all i = l,...,7i. The length of 
w, denoted by is n and the alphabet of w, denoted by Alph(u'), is the set 
{ai, . . . , On} of symbols occurring in w. The empty word is denoted by e. The set 
of all S- words is denoted by E*. A language is a set of words. For u = ai • • • a„ and 
w € E*aiS* • • • E*a„S*, we say that w is a subsequence of denoted hy v ^ w. 

A (nondeterministic) finite automaton or A^Fj4 ^ is a tuple (Q, E, (5, qo,F), where 
Q is a finite set of states, 5 : Q x S — >■ 2'^ is the transition function, go e Q is the 
initial state, and F C Q is the set of accepting states. We sometimes denote that 
<?2 G S{qi, a) as qi — > (72 G to emphasize that A being in state qi can go to state 
q2 reading an a € E. A run of A on word w = ai • • • a„ is a sequence of states 
Qq ■ ■ ■ Qn where, for each i — 1, . . . , n, we have g^-i — ^ qi G S. The run is accepting 
if qn G F. Word w is accepted by A if there is an accepting run of A on w. The 
language of A, denoted by L{A), is the set of all words accepted by A. By S* we 
denote the extension of 6 to words, that is, S*{q,w) is the set of states that can be 
reached from q by reading w. The size \A\ = \Q\ + J2q a ")l of ^ is the total 
number of transitions and states. An NFA is deterministic (a DFA) when every 
5{q, a) consists of at most one element. 

The regular expressions (RE) over E are defined as follows: s and every E- 
symbol is a regular expression; whenever r and s are regular expressions, then so 
are (r • s), (r + s), and (s)*. In addition, we allow as a regular expression, but we 
assume that does not occur in any other regular expression. For readability, we 
usually omit concatenation operators and parentheses in examples. We sometimes 
abbreviate an n-fold concatenation of r by r". The language defined by an RE r is 
denoted by L(r) and is defined as usual. Often we simply write r instead of L(r). 
Whenever we say that expressions or automata are equivalent, we mean that they 
define the same language. The size \r\ of r is the total number of occurrences of 
alphabet symbols, epsilons, and operators in r, i.e., the number of nodes in its parse 
tree. A regular expression is union-free if it does not contain the operator +. A 
language is union-free if it is defined by a union- free regular expression. 

A quasi-order is a reflexive and transitive relation. For a quasi-order ^, the (up- 
ward) =^-closure of a language L is the set closure''' (L) = {w \ v ^ w for some v € 
L}. We denote the ^-closure of a word w as closure'^ (w) instead of closure'^ ({w}). 
Language L is (upward) ^-closed if L = closure^ (L). 

A quasi-order ^ on a set AT is a well-quasi-ordering (a WQO) if for every infinite 
sequence {xi)°^i of elements of X there exist indices i < j such that Xi =<; Xj. 
It is known that every WQO is also well-founded, that is, there exist no infinite 
descending sequences xi )p X2 )^ ■ ■ ■ such that Xi ^ Xi+i for all i. 

Higman's Lemma |13j (which we use multiple times) states that, for every al- 
phabet E, the subsequence relation ^ is a WQO on S*. Notice that, as a corollary 
to Higman's Lemma, every ^-closed language is a finite union of languages of the 
form E*aiE* . . . E*a„E* which means that it is also regular, see also [5]. A lan- 
guage is piecewise testable if it is a finite boolean combination of ^-closed languages 
(or, finite boolean combination of languages E*aiE* • • • E*a„E*). In this paper, all 
boolean combinations are finite. 
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Figure 1: An example of a layer-separation. 
2.1 Separability of Languages 

A language S separates language K from L if S contains K and does not intersect L. 
We say that S separates K and L if it either separates K from L or L from K. Let 
be a family of languages. Languages K and L are separable by J- if there exists a 
language S in T that separates K and L. Languages K and L are layer-separable 
by J- if there exists a finite sequence of languages Si, ... , Sm in T such that 

1. for all 1 < i < m, language Si \ Uj=i Sj intersects at most one of K and L; 

2. K 01 L (possibly both) is included in IJJLi Sj- 

Notice that separability always implies layer-separability. However, the opposite 
implication does not hold, as we demonstrate next. 

Example 1. Let J- = {a"a* \ n > 0} be a family of <-closed languages over 
S = {a}, K = {a, a?}, and L = {a^,a*}. We first show that languages K and 
L are not separable by T . Indeed, assume that S € separates K and L. If K 
is included in S, then aa* C 5*, hence L and S are not disjoint. Conversely, if 
L C S, then a?a* C S and therefore S and K are not disjoint. This contradicts 
that S separates K and L. Now we show that the languages are layer-separable by 
T . Consider languages Si — a'^a* , S2 — a^a* , S3 — a^a* , and S4 — aa* . Then both 
K and L are included in S4, and Si intersects only L, S2\ Si — intersects only 
K, S3 \ {Si U 5*2) = intersects only L, and S4 \ {Si U 5'2 U S3) — a intersects only 
K; see Fig. [7| 

Example[l]illustrates some intuition behind layered separability. Our motivation 
for layered separability comes from the BonXai schema language which is discussed 
in the introduction. We need to solve layer-separability if we want to decide whether 
an XML Schema has an equivalent BonXai schema with simple regular expressions 
(defining languages in F). Layered separability implies that languages are, in a 
sense, separable by languages from in a priority-based system: If we consider 
the ordered sequence of languages Si, S2, S3, S4 then, in order to classify a word 
w £ K yj L in either K or L, we have to match it against the Si in increasing order 
of the index i. If we know the lowest index j for which w £ Sj, we know whether 
w £ K 01 w £ L. 

We now define a tool (similar to and slightly more general than the alternating 
towers of Stern |26j ) that allows us to determine when languages are not separable. 
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For languages K and L and a quasi-order ^, we say that a sequence (wi)i=i 
words is a ^-zigzag between K and L ii wi K U L and, for all i = 1, . . . , fc — 1: 

(1) Wi =^ Wi^i; (2) Wi E K implies Wi+i G i; and (3) Wi <E L implies w^+i G 

We say that k is the length of the ^-zigzag. We similarly define an infinite sequence 
of words to be an infinite =4-zigzag between K and L. If the languages K and L are 
clear from the context then we sometimes omit them and refer to the sequence as a 
(infinite) =^-zigzag. If we consider the subsequence order ^, then we simply write a 
zigzag instead of a ^-zigzag. Notice that we do not require K and L to be disjoint. 
If there is a, w ^ K L then there clearly exists an infinite zigzag: w^WjU), . . . 

Example 2. In order to illustrate infinite zigzags consider the languages K — 
{a[ahY^c{ac)'^^^ \ k,l > and L ^ {b[ahY^+^ c{ac)'^^+^ \ k,l > 0}. Then the 
following infinite sequence is an infinite zigzag between K and L: 

{b{abyc{acy if i is odd 

a{abyc{acy if i is even 

Indeed wi G L, words from the sequence alternately belong to K and L, and for all 
i > 1 we have Wi ^ Wi+i. □ 



3 A Characterization of Separability 

The aim of this section is to prove the following theorem. It extends a result by 
Stern that characterizes piecewise testable languages [26]. In particular, it applies 
to general languages and does not require K to be the complement of L. 

Theorem 3. For languages K and L and a WQO =^ on words, the following are 
equivalent. 

(1) K and L are separable by a boolean combination of ^-closed languages. 

(2) K and L are layer- separable by ^-closed languages. 

(3) There does not exist an infinite ^-zigzag between K and L. 

Some of the equivalences in the theorem still hold when the assumptions are 
weakened. For example the equivalence between (1) and (2) does not require =^ to 
be a WQO. 

Since the subsequence order ^ is a WQO on words, we know from Theorem [3] 
that languages are separable by piecewise testable languages if and only if they are 
layer-separable by ^-closed languages. Actually, since ^ is a WQO (and therefore 
only has finitely many minimal elements within a language) , the latter is equivalent 
to being layer-separable by languages of the form I]*aiS* • • • I]*a„S*. 

In Example [l] we illustrated two languages K and L that are layer-separable 
by ^-closed languages. Notice that K and L can also be separated by a boolean 
combination of the languages a*a^ ^ a*a'^ , a*a^ , and a*a^ from a.s K C ((a*a^ \ 
a*a^) U {a*a^ \ a*a^)) and L D {{a*a^ \ a*a^) U {a*a^ \ a*a^)) = 0. 

We now give an overview of the proof of Theorem |3] The next lemma proves 
the equivalence between (1) and (2), but is slightly more general. In particular, it 
does not rely on a WQO. 
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Lemma 4. Let J- be a family of languages closed under intersection and containing 
S* . Then languages K and L are separable by a finite boolean combination of 
languages from J- if and only if K and L are layer- separable by J- . 

The proof (given in the Appendix) is constructive. The only if direction is 
the more complex one and shows how to exploit the implicit negation in the first 
condition in the definition of layer-separability in order to simulate separation by 
boolean combinations. Notice that the families of ^-closed languages in Theorem [3] 
always contain E* and are closed under intersection. 

The following lemma shows that the implication (2) (3) in Theorem |3] does 
not require well-quasi ordering. 

Lemma 5. Let ^ be a quasi order on words and assume that languages K and L are 
layer-separable by =4-closed languages. Then there is no infinite =4-zigzag between 
K and L. 

To prove that (3) implies (2), we need the following technical lemma in which 
we require ^ to be a WQO. In the proof of the lemma, we argue how we can see ^- 
zigzags in a tree structure. Intuitively, every path in the tree structure corresponds 
to a ^-zigzag. We need the fact that =<; is a WQO in order to show that we 
can assume that every node in this tree structure has a finite number of children. 
We then apply Konig's lemma to show that arbitrarily long ^-zigzags imply the 
existence of an infinite ^-zigzag. The lemma then follows by contraposition. 

Lemma 6. Let =^ be a WQO on words. If there is no infinite ^-zigzag between 
languages K and L, then there exists a constant fc e N such that no ^-zigzag 
between K and L is longer than k. 

If there is no infinite ^-zigzag, then we can put a bound on the maximal length 
of zigzags by Lemma |6] This bound actually has a close correspondence to the 
number of "layers" we need to separate K and L. 

Lemma 7. Let ^ be a WQO on words and assume that there is no infinite ^-zigzag 
between languages K and L. Then the languages K and L are layer- separable by 
=4-closed languages. 

4 Testing Separability by Piecewise Testable Lan- 
guages 

Whereas Section 3 proves a result for general WQOs, we focus in this section exclu- 
sively on the ordering ^ of subsequences. Therefore, if we say zigzag in this section, 
we always mean ^ -zigzag. We show here how to decide the existence of an infinite 
zigzag between two regular word languages, given by their regular expressions or 
NFAs, in polynomial time. According to Theorem |3j this is equivalent to deciding 
if the two languages can be separated by a piecewise testable language. 

To this end, we first prove a decomposition result that is reminiscent of a result of 
Almeida ([2], Theorem 4.1 in ^). We show that, if there is an infinite zigzag between 
regular languages, then there is an infinite zigzag of a special form and in which 
every word can be decomposed in some synchronized manner. We can find these 
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special forms of zigzags in polynomial time in the NFAs for the languages. The main 
features are that our algorithm runs exponentially faster in the alphabet size than 
the current state-of-the-art 5j and that our algorithm and its proof of correctness 
do not require knowledge of the algebraic perspective on regular languages. 

A regular language is a cycle language if it is of the form u{v)*w, where u,v,w 
are words and (Alph(u) U Alph(?i;)) C Alph(w). We say that v is the cycle of the 
language and that Alph(w) is its cycle alphabet. Regular languages L-^ and L'^ are 
synchronized in one step if they are of one of the following forms: 

• L-^ = = {w}, that is, they are the same singleton word, or 

• L-^ and are cycle languages with equal cycle alphabets. 

We say that regular languages L-^ and are synchronized if they are of the form 
= D^D:^ ...D^ and = D^D^ ...D^ where, for all 1 < i < fc, languages 
Df' and Df are synchronized in one step. So, languages are synchronized if they 
can be decomposed into (equally many) components that can be synchronized in 
one step. Notice that synchronized languages are always non-empty. 

Example 8. Languages — a{ba)* aab ca bb{bc)* and — b{aab)*bacacc{cbc)*b 
are synchronized. Indeed, ~ D^D2^Df and L'^ = DfDf-^f for D^^ = alba)* aab, 
= ca, Df = bb{cb)* and Df = b{aab)*ba, Df = ca, and Df = cc{cbc)*b. 

The next lemma shows that, in order to search for infinite zigzags, it suffices to 
search for synchronized sublanguages. The proof goes through a sequence of lemmas 
that gradually shows how the sublanguages of L-^ and L'^ can be made more and 
more specific. 

Lemma 9 (Synchronization / Decomposition). There is an infinite zigzag between 
regular languages and if and only if there exist synchronized languages C 
L-^ and CL^. 

We now use this result to obtain a polynomial-time algorithm solving our prob- 
lem. The first step is to define what it means for NFAs to contain synchronized 
sublanguages. 

For an NFA A over an alphabet S, two states p, q, and a word w £ S* , we write 
p — > q if q € S*{p, w) or, in other words, the automaton can go from state p to state 
q by reading w. For Eo ^ states p and q are "Eo-connected in A if there exists a 
word uvw G Sq such that: 

1. Alph(i;) — Eq and 

2. there is a state m such that p ^ m, m ^ m, and m ^ q. 

Consider two NFAs A = {Q^,Y.,S^,qf, F^) and B ^ {Q'^ ,T.,S^ ,qE , F'^). Let 
(g^,g^) and (g"^,g^) be in x Q^. We say that (g^,g^) and {q^,q^) are syn- 
chronizable in one step if one of the following situations occurs: 

• there exists a symbol a in E such that q"^ g"^ and q'^ —^(f^^ 

• there exists an alphabet Eg C E such that q-^ and q-^ are Ep-connected in A 
and q^ and q^ are Eg-connected in B. 
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Figure 2: Synchronization of automata A and B. 

We say that automata A and B are synchronizable if there exists a sequence of pairs 
{q^, ), • • • , (q^, 9f ) e Q-^ X Q« such that: 

1. for all < I < fc, {qf^,qf) and ((ZJ+i, ^i+i) are synchronizablc in one step; 

2. states gff and qQ are initial states of A and respectively; and 

3. states q-^ and q^ are accepting states of A and ;B, respectively. 

Notice that if the automata A and B are synchronizable, then the languages 
L(A) and are not necessarily synchronized, only some of its sublanguages are 
necessarily synchronized. 

Lemma 10 (Synchronizability of automata). For two NFAs A and B, the following 
conditions are equivalent. 

1. Automata A and B are synchronizable. 

2. There exist synchronized languages C L[A) and K'^ C L[B). 

The intuition behind Lemma jTO| is depicted in Figure [2] The idea is that there 
is a sequence {q^^ q^), . . . , (g^, g^yThat witnesses that A and B are synchronizable. 
The pairs of paths that have the same style of lines depict parts of the automaton 
that are synchronizable in one step. In particular, the dotted path from q^ to 
has the same word as the one from qf to qf . The other two paths contain at least 
one loop. 

The following theorem states that synchronizability in automata captures exactly 
the existence of infinite zigzags between their languages. The theorem statement 
uses Theorem [3] for the connection between infinite zigzags and separability. 

Theorem 11. Let A and B be two NFAs. Then the languages L{A) and L{B) are 
separable by a piecewise testable language if and only if the automata A and B are 
not synchronizable. 

We can now show how the algorithm from [5] can be improved to test in poly- 
nomial time whether two given NFAs are synchronizable or not. Our algorithm 
computes quadruples of states that are synchronizable in one step and by linking 
such quadruples together so that they form a pair of paths as illustrated in Figure [2] 

Theorem 12. Given two NFAs A and B, it is possible to test in polynomial time 
whether L{A) and L[B) can be separated by a piecewise testable language. 
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J-(0,C) 


single 


unions 


be (boolean combinations) 


^ (subsequence) 


NP-complete 


PTIME 


PTIME 


:<s (sufBx) 


PTIME 


PTIME 


PTIME 



Table 1: The complexity of deciding separability for regular languages K and L. 



5 Asymmetric Separation and Suffix Order 

We present a bigger picture on efficient separations that are relevant to the scenarios 
that motivate us. For example, we consider what happens when we restrict the 
allowed boolean combinations of languages. Technically, this means that separation 
is no longer symmetric. Orthogonally, we also consider the suffix order between 
strings in which v^gW if and only if w is a (not necessarily strict) sufhx of w. An 
important technical difference with the rest of the paper is that the suffix order is not 
a WQO. Indeed, the suffix order has an infinite antichain, e.g., a, ab, abb, abbb, . . . 
The results we present here for suffix order hold true for prefix order as well. 

Let be a family of languages. Language K is separable from a language L by 
J- if there exists a language 5' in that separates K from L, i.e., contains K and 
does not intersect L. Thus, if L is closed under complement, then K is separable 
from L implies L is separable from K. The separation problem by J- asks, given an 
NFA for K and an NFA for L, whether K is separable from L by J-. 

We consider separation by families of languages J-{0,C), where O ("order") 
specifies the ordering relation and C ( "combinations" ) specifies how we are allowed 
to combine (upward) O-closed languages. Concretely, O is either the subsequence 
order ^ or the suffix order We allow C to be one of single, unions, or be (boolean 
combinations), meaning that each language in F{0,C) is either the 0-closure of 
a single word, a finite union of the O-closures of single words, or a finite boolean 
combination of the O-closures of single words. Thus, J^{<, be) is the family of 
piecewise testable languages and J-{^s, be) is the family of sufhx-testable languages. 
With this convention in mind, the main result of this section is to provide a complete 
complexity overview of the six possible cases of separation by J-{0,C). The case 
•^(^5 be) has been proved in Section |4] and the remaining ones are proved in the 
Appendix. 

Theorem 13. For O € {dn^s} o,nd C being one of single, unions, or boolean 
combinations, we have that the complexity of the separation problem by J-{0, C) is 
as indicated in Table[^ 

Since the separation problem for prefix order is basically the same as the sep- 
aration for suffix order and has the same complexity we didn't list it separately 
in the table. Furthermore, from Lemma |4] we immediately obtain that deciding 
layer-separability for all six cases in Table ^ is in PTIME. 

6 Conclusions and Further Questions 

Subsequence- and suffix languages seem to be very promising for obtaining "simple" 
separations of regular languages, since we can often efficiently decide if two given 
regular languages are separable (Table [T]). Layer-separability is even in PTIME in 
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all cases. Looking back at our motivating scenarios, the obvious next questions are: 
if a separation exists, can we efficiently compute one? How large is it? 

If we look at the broader picture, we wonder if our characterization of sepa- 
rability can be used in a wider context than regular languages and subsequence 
ordering. Are there other cases where it can be used lead to obtain efficient deci- 
sion procedures? Another concrete question is whether we can decide in polynomial 
time if a given NFA defines a piecewise-testable language. Furthermore, we are also 
interested in efficient separation results by combinations of languages of the form 
Yl*Wi'S* ■ ■ ■ T,*Wn or variants thereof. 

Acknowledgments. We thank Jean-Eric Pin and Marc Zeitoun for patiently an- 
swering our questions about the algebraic perspective on this problem. We are 
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Appendix 

Connection to the Algorithm of Almeida and Zeitoun 

Almeida and Zeitoun 5] show that the following problem is in polynomial time: 

Input: Two DFAs A = iQ^,J:,S^,q^, F-^) and B = (Q^ ,q^ , F'^) with 
constant-size alphabet S. 

Problem: Are L{A) and L{B) separable by a piecewise testable language? 

A result by Almeida 3J says that separability is equivalent to computing the inter- 
section of topological closures of the regular languages that are to be separated. This 
is used by Almeida and Zeitoun [5], who prove that these topological closures can 
be represented by a class of automata (going beyond DFAs or NFAs) computable 
from the original automata. The main differences with the present procedure are 
that the construction of [5^ is 

(1) exponential in the size of the alphabet and 

(2) defined on DFAs rather than on NFAs. 

Actually, the exponential time bound w.r.t. the alphabet size has already been 
observed by Almeida and Zeitoun in the conclusions of their paper [?j . The reason 
why the algorithm from [5] is exponential in the size of the alphabet is that its first 
step consists of adding, to each loop that uses a subset B of the alphabet E, a new 
loop containing B". (See Definition 4.1 from [S] - the notation _B" is borrowed from 
that paper.) The number of these subsets can be exponential. In fact, the algorithm 
from first adds these cycles to A and B separately and then (after some more 
operations) compares the automata to each other. However, the relevant cycles to 
add to A depend on B. 

Example 14. Consider a language 

LiA)^ial---a*J\ 

Then, for every subset S = {ii, . . . ,ij} C {1, . . . , n}, there exists a language 

L{Bs) = {ai^ ■ --a^.)* 

such that the intersection of the closures of the above languages contains only 
for words u such that Alpli(u) = {flj^ , . . . , a^^. }. 
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The example shows that if we compute the closure of L{A) without looking 
simultaneously at B we have to keep all of the exponentially many loops in order to 
be prepared for intersecting this closure with the closure of any possible language 
L{Bs)- In fact, one needs to do more than naively compute largest common subsets 
of alphabets of loops that obviously correspond to each other. We show how to do 
this while avoiding the exponent in 

The following is a slightly less trivial example that shows how alphabets of 
strongly connected components can correspond to each other. 

Example 15. Consider languages 

(a* be* de* ) * acb* ac{ba) * caibc) * b 

and 

[ab)* d{ab* df*)*b{c(ab)* c*b* a{cb)*)*b. 

These languages cannot be separated by a piecewise testable language. The example 
of profinite word in the intersection of the closures is 

(abdYcac{abYca{bcY. 

(Again, the notation m'^ is borrowed from and is the standard one for the unique 
idempotent power of element m of the semigroup.) 

Proofs of Section [3] 

Lemma [4[ Let J- be a family of languages closed under intersection and containing 
S* . Then languages K and L are separable by a finite boolean combination of 
languages from J- if and only if K and L are layer-separable by J- . 

Proof. For a sequence Si, 82, . ■ . , Sk denote, for alH = 1, . . . , fc 

Df = U S, . 

To show the if part, assume that 6*1, 52, ... , Sm is the sequence of languages 
from T layer-separating K and L. We will construct a finite boolean combination 
of languages from that separates K and L. By definition of layer separability, each 
language Df intersects at most one of K and L. Furthermore, X or L is included 
U^i — Ujli ^j- Without loss of generality, assume that K is included in 
UjLi of, and set J ={j iB^CiK 0}. Then the language S = IJ^gj Of separates 
K and L. As S* is a finite boolean combination of languages from this part is 
shown. 

To show the only i/part, assume that K and L can be separated by a language 
S that is a finite boolean combination of languages from U = {Ui, . . . ,Uk}, a finite 
subset of Without loss of generality, assume that K C S and L S — 0. For 
any subset of indices / C {1, . . . , fc} we denote 

cewuii) = (n^') (n^)' 
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where Ui = T,* \ Ui and call this language a cell] see Fig. [s] for an illustration. 
Observe that the cells are pairwise disjoint and 5 is a finite union of cells. As S 
separates K and L, every cell intersects at most one of the languages K and L. The 
cells that form S do not intersect L and the others do not intersect K . Based on 
this, we construct a layer-separation of K and L by T . 

To this end, we show that there exists a sequence of languages Si, ... ^ from 
F and a bijection tt : {1, . . . , 2^} — > V{{1, . . . , k}) such that, for every 1 < j <2^ 

Df = ce%(7r(j)). 

We call Si, ... , S2k a sequence of cell- separating languages for U. It is easy to 
see that this sequence Si, ... , 5*2*; would layer-separate K and L. Indeed, for each 
1 < i <2^ , the set Df is a cell. Thus it intersects at most one of K and L, which 
is the first requirement of a layer-separation. Moreover, the union lJi<i<2'= = 
Ui<i<2'= of includes all the cells, so it equals S*, thus clearly includes both K and 
L.^ ^ 

Therefore, it only remains to prove that there exists a sequence ^i, . . . , S2k of 
cell-separating languages for U. Before we show it formally we present an illustrating 
example in order to give an intuition how the required sequence is constructed. For 
U = {Ui, U2, U3} the cell-separating sequence is as follows: 

t/i n c/2 n Us, U2 n u^, Ui n c/3, U3, Ui n c/2, 1^2, c/i, s* 

We prove the fact in general by induction on k that there is a sequence of cell- 
separating languages for every /c-element set U Q F. For the base step, i.e., k = 1, 
we have that U = {Si}. We can simply take Si = Ui and S2 = S* and we are 
done. Assume now that, for some fc, the induction hypothesis is satisfied. We prove 
it for A; + 1. Consider an arbitrary subset U' = {[/i, . . . , C/^, t/fe+i} of and take 
U — {Ui, . . . ,Uk}. Let 5*1, ... , S2k be the sequence of cell-separating languages for 
U. We will show that the sequence 

Si n Uk+i, . . . , S2k n Uk+i, Si, . . . , S2k 

is cell-separating for W. We name this sequence Ti, . . . ,T2k+i, i.e., 

^ ^ is,nUk+i, if i<2'= 
' \s,-2k, if i>2'=. 

It is sufficient to show that there exists a bijection g between {!,..., 2*^+^} and 
V{{1, ...,k + l}) such that for 1 < i < 2*=+^ 

Df = celW(a(z)). 
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Figure 3: Cells for two languages Ui and U2. 
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Assume that tt is a bijection between {1, . . . , 2*"'} and 'P({1, ■ • ■ , k}) such that 

Df = celluiirii)). 

We will show that a defined as 



a{i) 



''!T{i)U{k + l}, if i < 2*^ 
7r(i-2'=), if i > 2^= 



fulfills the necessary condition. If i < 2 then 

2—1 i—1 i—1 

Df - T, \ U = n Uk+i) \ U (5, n Uk+,) = {s,\\J s,) n c/fc+i 

j=i i=i j=i 

= Df n Uk+1 = celW(7r(i)) n Uk+i - ce%,(cr(2)). 

On the other hand if « > 2'^ then 

i-l 2'' i-2''-l 

j=i i=i j=i 

i-2'"-l 

= (^.-2'^) \ {Uk+1 U y 5,) = Df_2. \ Uk+1 

= ccllw(7r(i - 2'=)) \ Uk+1 - celW((7(i)), 

since U^=i "^i = ^* ; which completes the proof. □ 

Lemma [5} Let =4 be a quasi- order on words and assume that languages K and 
L are layer-separable by ^-closed languages. Then there is no infinite =4-zigzag 
between K and L. 

Proof. For the sake of contradiction, assume that there exists an infinite ^-zigzag 
{wi)'^i between K and L. Let / = {wi, i(J2, ■ • ■} and consider the sequence of 
languages S'l, . . . , 5*^ layer-separating K and L. Let k in {!,..., m} be the lowest 
index for which SkDl ^. (Notice that k exists by definition of layer-separations.) 
Since we chose k to be minimal, for every j > 1 it holds that 

fe-i 



Let £ > 1 be such that wg E SkCi I. Without loss of generality, assume that we G K. 
(Otherwise, we switch K and L.) Then, by the definition of zigzag, we+i € L. As 
Sk is =^-closed and as we =4 W£+i, we have that also wg+i e Sk- Thus, 

fe-i fc-1 
Wi+i eun {Sk\[j S^) and wgeK n {Sk\[j Si) . 

i=l i=l 

But then the set Sk \ [JlZi Si intersects both languages K and L, which is a con- 
tradiction with the assumption that S'l, ... , Sm layer-separates K and L. □ 
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In the next proof we use Konig's Lemma, which we recall next. A tree is finitely 
branching if every node has finitely many children. Note that, for every n > there 
can be a node that has at least n children. 

Lemma 16 (Konig }15j). A finitely branching tree containing arbitrarily long paths 
contains an infinite path. 

Lemma [6} Let =4 be a WQO on words. If there is no infinite ^-zigzag between 
languages K and L, then there exists a constant fc G N such that no =4- zigzag between 
K and L is longer than k. 

Proof. All =^-zigzags considered in this proof are between languages K and L. We 
show that the existence of arbitrarily long =<;-zigzags implies the existence of an 
infinite =^-zigzag. To this end, we restrict the general form of =<;-zigzag to be able to 
use Konig's Lemma. Note that any WQO allows equivalent elements. For a word 
w, let [w] = {v £ T,* \ V ^ w and w ^ v} denote the equivalence class containing 
w. For languages K and L, we arbitrarily pick two elements from the sets [w] D K 
and [w] n L, if they exist, denoted by [w]k and [w\l, respectively, and call them 
canonical elements of the class [w]. We say that a =<;-zigzag {wi)^^i is canonical if 
it consists only of canonical elements, that is, ii wi £ K then Wi = [wi]K, and if 
Wi € L then Wi — [wi]^- Observe that if there exists a ^-zigzag of length k then 
there also exists a canonical =^-zigzag of length k. Indeed, replacing all elements 
of the =^-zigzag with their corresponding canonical elements results in a canonical 
=<;-zigzag. Thus, in what follows, we consider only canonical =<;-zigzags. Note that 
we reduced the quasi order to an order. We say that a =<;-zigzag (wi)f^x denser 
than a =^-zigzag (t'i)f^i if 

• Wi ^ Vi, for all I < i < k; 

• Wi (z K Vi Cz K, for all 1 < i < fc, and also symmetrically for L; and 

• there exists 1 < j < k such that wj ^ Vj. 

A ^-zigzag is densest if there is no denser ^-zigzag. 

Note that if a ^-zigzag is densest then {wi)j^^ is also densest for any 

j < k. Indeed, if {vi)l^^ is denser than {wi)l^^ then wi, . . . , Vj,Wj+i, . . . ,Wk is also 
a valid =<;-zigzag, which is denser than {wi)^^^. Furthermore, observe that if there 
exists a ^-zigzag of length k then there also exists a densest ^-zigzag of length 
k because the denser order is well founded, as a suborder of a fc-componentwise 
product of well founded orders. Thus, by the assumtions, there exist arbitrarily 
long densest ^-zigzags. Their first element belongs either to K or to L. Without 
loss of generality, we may assume that there are arbitrarily long densest =^-zigzags 
starting in K. Note that the first word in every densest ^-zigzag is the shortest 
canonical element with respect to the order ^ among the canonical elements of 
K. As the order =<; is a WQO there are only finitely many shortest canonical 
elements, thus there exists a word w € K such that there are arbitrarily long 
densest ^-zigzags starting from w. Consider a tree consisting of all these =^-zigzags 
forming its paths. By definition, this tree has arbitrary long paths. It is also finitely 
branching; otherwise, if a node has infinitely many children labelled by different 
words vi,V2t . the WQO property implies that we can find a pair of indices i < j 
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such that Vi =4 Vj . Then the =^-zigzag obtained by choosing the path going through 
Vj is not densest as we can change Vj into Vi in this zigzag obtaining the denser 
one. Thus, by Lemma [T6j this tree contains an infinite path that forms an infinite 
=^-zigzag. □ 

Lemma [7} Let =4 be a WQO on words and assume that there is no infinite =4- 
zigzag between languages K and L. Then the languages K and L are layer- separable 
by =4-closed languages. 

Proof. For two languages X and Y , let 

layer(X, Y) = {w ^ X \ there does not exist w' in Y such that w =4 w'} 

denote the set of all words of X that are not smaller or equal to a string of Y in the 
WQO. We first show the following claim: 

Claim 17. There exists a =4-closed language 5(x y) such that S(^x y) (^Y — $ and 
S,^x,Y)r\X^layer{X,Y). 

The proof of the claim is simple. Let S(^x.y) ~ Uioeiayci(x y) 
definition, 5'(x,y) is =^;-closed. For each w in layer(X, Y), we have that closure^ {w)r\ 
y = by definition of layer(X, y). Therefore, S(x,y) H F = 0. Moreover, we have 
that layer(X, Y) — S(^x,y) H X because w G layer(X, Y) implies that (closure''' {w) n 
X) C layer(X, y). This concludes the proof of the claim. 

We now proceed with the proof of Lemma [7j Let be a constant such that no 
^-zigzag between K and L is longer than B. This constant exists by Lemma |6j 
since there is no infinite =^-zigzag between K and L. Define the languages Kq = K, 
Lq = L, and, for each i E N, 

K^+i = if I \ layer(i4'j, Li) Li+i = Li\ \a.yer{Li, K^) . 

We prove by induction on i that every ^-zigzag between Ki and Li has length at 
most B — i. The claim holds for Kq and Lq, thus consider -fC^+i and Li+i, for 
i > 0. Since i^i+i C Ki and C Li we have that every =^-zigzag between 

Ki^i and L^+i would also be a ^-zigzag between Ki and Li. By induction we 
know that every ^-zigzag between Ki and Li has length at most B — i. Therefore, 
every ^-zigzag between i^i+i and Li+i also has length at most B — i. It remains 
to prove that there cannot be a ^-zigzag of length B — i between Ki+i and L^+i. 
For the sake of contradiction, assume that {wk)^Zi ^ ^-zigzag between Ki+i and 
ii+i of length B — i. We either have that WB-i G i^i+i or WB-i € Li+i. We 
prove the case wb-i G ifi+i since the other case is analogous. Here, we have that 
WB-i ^ l£^y6r(-fl'i, Li). By definition of layer(ii'j, L^), there exists a, w € Li such that 
WB-i =4 w. But this means that the sequence wi, . . . , WB-i,w would be a =^-zigzag 
between Ki and Li of length B — i + I, which is a contradiction. 

We therefore have that, for every i, every ^-zigzag between Ki and Li has 
length at most B — i. In particular, this means that if i > i3, every =^-zigzag 
between languages Ki and Li has length at most zero. Since any word w E Ki U Li 
would already be a ^-zigzag of length one, this means that Ki = Li = 0. 

We now show how the languages K and L can be layer-separated by ^-closed lan- 
guages. Denote by S(^x,y) the ^-closed language obtained when applying Claim 
to languages X and Y. Then, the sequence 
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S{Ko,Lo)^ S{Lo,Ko)^ S(K^M)^ S^LuKi), S^Kb-i,Lb-i)^ S(Lb-i,Kb-i) 

covering the layers 

layer (i^oi-^o), layer(Lo, J'^o), ^&yev{KB-i, Lb-i), la.ycr{LB-i, Kb-i), 

respectively, layer-separates K and L. Condition 1 of the definition of layered sep- 
arability is satisfied because all the languages covering layers with smaller numbers 
appear earlier in the sequence. Condition 2 is true because the union of all the 
considered layers includes K U L. □ 

Proofs of Section |4] 

Proof of Lemma |9] with Running Example 

In this section we prove Lemma |9] 

Lemma [9| There is an infinite zigzag between regular languages L-^ and if 
and only if there exist synchronized languages C L-^ and 

To prove it, we need several auxiliary results showing that if there is an infinite 
zigzag between two regular languages, then there is also an infinite zigzag between 
their sublanguages of a special form. 

To illustrate the proofs of this section we use a running example with regular 
languages 

= a{b*aya{bbyabcahh{hcY + (a6*c)* -f 6*c(c6)* 

and 

if = abd + b{aab)*baca(b{cb*)* c)* cc{cbc)*b + (aa)* + ba{bb)* 

having an infinite zigzag between them. After each step we present how the consid- 
ered languages have been modified. 

We say that language K embeds into L, denoted by K ^ L, if for every v G K, 
there exists a. w G L such that v ^ w. In order to be consistent we also say here 
that word v embeds into word w if {v} embeds in {w}, i.e., u is a subsequence of w. 
Languages K-^ and are mutually embeddable if < and K'^ < K-^. Note 
that there always exists an infinite zigzag between nonempty mutually-embeddable 
languages. 

Lemma 18 (Mutual embeddability). // there is an infinite zigzag between regu- 
lar languages and , then there exist nonempty mutually-embeddable regular 
languages C L"^ and C . 

Proof. We define languages and and show that they possess the required 
properties. Let / denote the set of all words that belong to any infinite zigzag 
between languages L-^ and L^, and let = f] I and K'^ = D I. Then, for 
any w € K-^, let 7^, denote an infinite zigzag containing w. As Im is infinite, there 
exists e Ju, n such that w ^ w' , hence w' e . Therefore ^ if^. The 
case K'^ :< is analogous. 

It remains to show that and K'^ are regular. We prove it for since the 
case for is analogous. Let M denote the set of all minimal words of \ K-^, 
that is, words w S \ such that there is no w' G \ with w' < w and 
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w' 7^ w. Note that any distinct words w and w' of M are incomparable, i.e., w w' 
and w' ^ w. By Higman's lemma, M is finite. If w G \ K^, that is, w ^ I, then 
any w' € with w < w' also belongs to L-^ \ K^; otherwise, w' e = n / 
implies that w € I, which is a contradiction. Thus, 

L^\K-^ = L-^n y closure(u;) . 

Notice that the language Uu,eM closure(w) is ^-closed, hence regular, and so is the 
language K-^ = L^\{L^\K^). □ 

By Lemma [18] our running example could be reduced to 

Li = a{b*a)*a{bb)*abcabb{bc)* +b*c{cby 

and 

= b{aab)*baca{b{cb*yc)*cc{cbc)*b+ {aa)* + ba{bb)\ 

since words from {ab*c)* C L^^ and abd C does not belong to any infinite zigzag. 

Now we strengthen the result by imposing a union-free decomposition on the 
languages and . 

Lemma 19 (Union-free languages). // there is an infinite zigzag between regular 
languages and L'^ , then there exist nonempty mutually- embeddable union-free 
regular languages C and C . 

Proof. By LemmalTSl there exist nonempty mutually-embeddable regular languages 
M-^ C L-^ and M° C . By '5T| , see also [T] , every regular language can be 
expressed as a finite union of union-free languages, hence we have 

M-^ = U L*^ U . . . U and M'^ = U U . . . U Df , 

where all languages Df^ and are union-free. It remains to show that there exist 
1 < i < k and 1 < j < i such that Df^ and arc mutually embeddable. We first 
show that for each there exists a Df such that D,^ -< . Consider a union-free 
regular expression for D^. For any n G N, we define w„ as a word obtained from 
the expression for Df^ by replacing stars with n. For the expression {ab*c)*{cb)* we 
have Wn = {ab^ c)"' {ab)^ . Note that for every w e Df', there exists n e N and a 
word w„ G I?^ such that < w^. Number n can be chosen s rt = since w„ is 
in by definition. 

Consider now the sequence (wn)^i. Every word Wn can be embedded to a word 
of M^, therefore to a word of , for some j. Thus, there exists a jo such that 
infinitely many words of the sequence embed to words of Z?^^. We claim 

that Df^ :< D^^^. As mentioned above, for every w e -D^, there exists an n such 
that w :< Wn- As there are infinitely many words Wg embedding to D^^^, there exists 
m>n such that Wm embeds to D^^. Clearly, Wn di Wm, thus w < Wn d Wm and, 
hence, w embeds to Df , which shows that Df' -< Df . 

' JO ' — JO 

Thus, there is a function / : {!,..., fc} {!,...,£} such that Df ^ D^^.^, 
and a function g : {!,...,£} {!,..., fc} such that ^ ^'g(j)' -Define the 
function h : {1, . . . , fc} — {1, . . . , fc} by h{i) — g(f{i)). Considering the sequence 
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h^{l), ■ ■ . at some moment we encounter a repetition since all the values 
come from a finite set. In other words there exist numbers c,d € N, c < d, such 
that = h'^{l) = i. This means that Df < D^^.^ ^ Df. We assign = Df 

and K'^ = D^^ .y □ 

By Lemma [l9] by we can simplify the example to 

= a{b* a)* a{hb)* abcabb{bc)* and if = b{aab)*baca{b{cb*)* c)* cc{cbc)*b 

by eliminating b*c{cb)* C L-^ and (aa)* + ba{bb)* C if. 

We now consider a transformation L ^ L' transforming a language to a sim- 
pler one, which is still mutually embeddable with the origin. By transitivity of 
the mutual-embeddability relation, we may transform two mutually embeddable 
languages and the results will also be mutually embeddable. Next four lemmas 
describe this transformation. 

Lemma 20 (Star depth one). For every union-free regular language L, there exists 
a union-free regular language L' such that: 

• L' is of star depth at most one, i.e., it has a regular expression of the form 
Vi{v2)*V3 . . . (w2i)*"2'i+i, where all vj G S*, 

• L and L' are mutually embeddable, 

• L' QL. 

Proof. Consider a union- free regular expression r such that L{r) — L. Then r is of 
the form 

r = 'Ul(f2)*W3(^4)*«5 • • ■ {r2i)*V2i+l, 

where V2j+i G S* and r2j is a (union-free) regular expression, for j < i. Let V2j be 
obtained from r2j by deleting all star operations. For example, if r2j — a{bcd*b)* 
then V2j — abcdb. Clearly, L{r') C L. Our aim is to show that L and L{r') with 



r' = Vi{v2)*V3{v4)*V5 . . . {v2i)*V2 



are mutually embeddable. Recall from the proof of Lemma 19 that for every word 
w ^ L there exists n such that w ^ Wn (where Wn denotes the word obtained from r 
by replacing all star operations with n). It is now sufficient to show that Wn ^ L(r') 
for every n > 1. To this end, fix an arbitrary n € N and assume that 

Wn = V1V2V3, . ..V2iV2i+l, 

where V2j G L{r2j), for j < i. Note that each symbol occurring in V2j also occurs in 

V2j. Thus, V2j d: '^2'/'^ and, therefore, Wn ^ L{r'), which implies that L{r) ^ L{r'). 
Since L{r') C L{r) implies that L{r') < L{r), the proof is complete. □ 

Due to Lemma [20] we may eliminate the stars on depth more than one obtaining 

= a{ba)* a{bb)* abcabb(bc)* and if = b{aab)*baca{bcbc)*cc{cbc)*b. 

Recall that every union-free language of star depth at most one is of the form 
Vi{v2)*V3 . . . {v2i)*V2i+i. We call the words V2j in the scope of a star operation loops. 
A loop V2j with Alph(w2j) = So ^ E is called a Sg-Zoop. 
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Lemma 21 (Eliminating a loop). Let L be a regular language of the form L = 
Vi{v2)*V3 ■ ■ ■ iv2i)*V2i+i md assuuie that for some I < k, £ < i, k ^ i, 

• Alph(w2£) C Alph(w2jt), and 

• Alph(wj) C Alph(u2fc) for all min(2fc, 2£) < j < max(2A;, 2£). 

Then the languages L and L' = Vi{v2)*v^ . . . V21-1V21+1 ■ ■ ■ iv2i)* i^2i+i obtained from 
L by eliminating the v^g loop are mutually embeddable. 

Proof. We can assume that k < i. Indeed, if the lemma holds for k < £ we can 
immediately infer that it also holds for k > £ because K and L are mutually em- 
beddable if and only if the reversed languages K^'^'^ and are mutually embed- 
dable. As L' C L we have that L' < L. It remains to show that L embeds to 
L' . Fix an arbitrary word w € L and assume that w = V1V2V3 . . . V2iV2i+i where 
V2j G V2j, for i < i. Note that every symbol occurring in V2kV2k+i ■ ■ ■ V2t-iV2t 
belongs to Alph(u2fe). Then V2kV2k+i ■ ■ - ^21-1^21 embeds to (w2A;)'"'', which implies 
that w ^ V1V2 ■ . ■ W2fc-i(w2fe)'™'w2^+i • ■ • W2j«2i+i and, therefore, also 

W dl V1V2 . ■ . V2k-l{v2k)^"'^V2k+l ■ ■ ■ W2£-lW2<'+l ■ • ■ W2jW2i+l G L' , 

which completes the proof. □ 



Using Lemma 21 we may eliminate unnecessary loops (bb)* in Lf and {bcbc)* 



(or, alternatively, (cbc)*) in Lf obtaining 

— a{ba)* aabcabb{bc)* and if b{aab)*bacacc{cbc)*b. 

Note that these are the languages from Example [8j 

We call a union-free regular expression of star depth at most one with expressions 



Vk and vi as mentioned in Lemma 21 redundant since, intuitively, it has a redundant 
loop. A union-free regular expression of star depth at most one that is not redundant 
is called nonredundant. We use the same notions for the corresponding languages. 
In what follows, when we speak about a regular expression of a nonredundant or 
redundant language, we mean the corresponding nonredundant or redundant regular 
expression, respectively. A nonredundant regular expression of the form 

Vl{v2)*V3{Vi)* . . . {v2k)*V2k+l, 

where € S*, for 1 < i < 2fc -I- 1, is called saturated if for any two loops u„ and u„ 
all symbols from Alph(t;„i)UAlph(i;„) occur in between. The language of a saturated 
expression is called saturated. The intuition behind this notion is explained below. 
The following lemma shows that we may assume that our languages are saturated. 

Lemma 22 (Unfolding loops). Let L be a nonredundant language. Then there exists 
a saturated language L' Q L such that L and L' are mutually embeddable. 

Proof. Let the regular expression of L be r = Vi{v2)*v^{v4)* . . .{v2k)*V2k+i and 
define r' — viV2{v2)*V2V3V4{v4)*V4 . . .V2kiv2k)*V2kV2k+i, where all the loops are 
unfolded once in every direction and the corresponding language is L' = L(r'). The 
nonredundancy of V is clear. Indeed, if there are two loops Vi and Vj in r' such that 
one of them has a bigger alphabet and every symbol in between Vi and Vj belongs to 
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this alphabet, then the situation also takes place before the unfolding of the loops, 
in the regular expression r. Furthermore, it is easy to see that L' is saturated. It 
is thus sufficient to show the mutual embeddability. Note that L{r') C L(r), hence 
L{r') < L{r). On the other hand, every word from L(r) either belongs to L{r') or 
embeds to a word of L{r') obtained by unfolding some loops several times. □ 

After unfolding the loops {ba) and {be) in and {aab)* and {cbc)* in we 
obtain 

= aba(ba)* baaabcabbbc{bc)* be 

and 

Lq = baab{aab)* aabbacacccbc{cbc)* cbcb. 

In fact languages and were already saturated, but this is not always true in 
general for nonredundant languages. 

The So'decomposition of a saturated regular expression r is of the form 

ri ui{vi)*wi r2 U2{v2)*W2 ■■■ Tk Uk{vk)*Wk rfc+1, 

where words vi,V2, ■ ■ ■ ,Vk are Eo-loops in r, words Ui and Wi satisfy Alph(ui) U 
Alph(K;i) C Eq, for 1 < i < k, and ri, r2, . . . , r^+i are nonredundant expressions 
without Eg-loops starting and ending with symbols not belonging to Eg. 

Notice that the Eo-decomposition may not exist for non-saturated expressions. 
Consider for instance the expression {ab)*a{ac)*, and try to compute its {a, 6}- 
decomposition. It does not exists, as, intuitively, there is no symbol outside {a, b} 
between the {a, 6}-loop and {a, c}-loop. Thus it is not possible to start an expression 
f2 by symbol not belonging to {a, 6}, as required above. This is the reason why 
we need to make it saturated, for example by unfolding the loops like in the proof 



of Lemma 22 Then we obtain the expression ab[ab)* abaac{ac)* ac, which has the 
{a, 6}-decomposition of the form ri = e, ui = ab, vi = ab, wi = abaa and r2 — 
c{ac)*ac starting with symbol outside {a, 6}, as needed. 

For two saturated regular expressions r-^ and we say that an alphabet Eq C E 
is [r^ ^r'^) -loop-maximal if 

1. there exists a Eg-loop either in or in r^; and 

2. there is no E' D Eq for which a E'-loop occurs either in or in . 

If and r'^ are clear from the context we simply say that an alphabet Eg C E is 
loop-maximal. 

Lemma 23 (Decompositions). Let and L'^ be two saturated and mutually- 
embeddable languages with r-^ and being their saturated regular expressions. Let 
Eg C E 6e loop-maximal. Let the T,o-decomposition of r^ be 

r^ = rt uiivtYwt utiv^ywt . . . utiv^yw^ r^+i . 

Then the numbers ofYjQ-loops in r^ and r^ coincide. Moreover, the Y,Q-decomposi- 
tion of r^ is 

B _ B ,,Bi„,By B B ,.B(„,By,„B ^B ..B (..By „8 

where, for alll < i < k + 1, the languages L{rf^) and L{rf) are mutually embeddable 
and saturated. 
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Proof. If V — ai ■ ■ ■ ttk embeds into w ~ bi ■ ■ ■ bi such that v ~ bi-^ ■ ■ ■ bi^, for ii < 
. . . < ik then we say that symbol aj embeds into the position ij with respect to this 
embedding. Usually, if embedding is clear from the context, we omit it. 

We first show that both r-^ and r'^ have the same number of So-loops. For the 
sake of contradiction, assume that there are more Sp-loops in r-^ than in r^. We 
will exploit the fact that L-^ ^ L'^ . Let m be the size of r'^ , i.e., the number of 
symbols in it, and consider an arbitrary word 

_ „A A( A\m+1 A A A( A\m+1 A A A/ A\m+1 A A ^ tA 
V — S^U-^ l^^l j W'l ^2 ^2 y^2 ) W2 ■■■ Sf. Uf. [V^. ) W^. S,.^^ fc h , 

where G L(r^), for i < fc+ 1. There is a word w € such that v ^ w. Consider 
an arbitrary Vj^, for 1 < j < fc. There are at least m + 1 occurrences of Vj^ in v and 
for each one the last symbol of Vj^ coincides with a symbol of r'^ . As there are m + 1 
words Vj^ there are also m + 1 positions in in which their first symbol embeds. 
By the pigeonhole principle, at least two of them coincide in r^. Recall that there 
is no Sq-Ioop for Eq 13 Sq in i'^ ■ Thus some repeated position x in has to be 
inside some Sp-loop; otherwise, it would not be possible to read several words 
and after this end up in the same position in r^. Therefore we define a mapping 
from So-loops in r-^ to So'loops in r'^ , which maps a loop from r-^ to some loop in 

in which the above discussed repeated position occurs. Note that there possibly 
could be more than one such loop in r^, then we pick one of them. 

We will show that no So-loop in is assigned to two different So-loops vf- and 
Vj^ from r-^. Assume, to the contrary, that both vf^ and v^, for i < j , are mapped 
to the same Eo-loop vf in r^. Thus every symbol in between vf^ and have to 
embed in some position in the loop vf . However, recall that there exists a symbol 
a ^ Sq in between the loops vf^ and Vj^, while loop vf contains only symbols 
from Sq. This leads to the contradiction. Therefore, in particular, there are not 
more Eo-loops in r-^ than in r^. 

Thus, we may assume that the Eo-decomposition of is of the form 

r« rf ufivfyu^f 4 «f (.f )*u;f . . . rf 4{vfrwf rf^, . 

By definition of the Eo-deconiposition all and rf are nonredundant. It remains 
to show that the languages L{rf^) and L{rf) are mutually embedded. Fix an index 
i. We show that L{rf-) ^ L{rf) since the other direction is analogous. Assume, 
to the contrary, that a word u £ L[r^) does not embed to L{rf). Note that the 
word v above was chosen arbitrarily, with the only restriction that Eo-loops were 
repeated to + 1 times each. Thus, put = u and consider the position in word 
w where the last symbol of u could embed inspecting from left to right. As 
shown above, u cannot embed earlier than in vf_i. Recall that the last symbol of 
does not belong to Eg, thus it does not embed to vf_i and wf_i. As u L{rf), 
the last symbol of u does not embed to the infix of w corresponding to rf . One 
more time, as the last symbol of sf- does not belong to Eq it does not embed to 
uf{vf)*wf. Thus, the first position where it could embed is somewhere in rf^^. 
Then, however, we have to assign Ep-loops corresponding to words fj^^^' ■ ■ ■ ^'^'k 
Eg-loops corresponding to words vfj^2i ■ ■ ■ j'^f in (as shown above) an injective way, 
which is not possible. □ 



Proof of Lemma It is easy to see that if the languages K-^ and K'^ are nonempty 
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and synchronized then there exists an infinite zigzag between them, thus also be- 
tween languages and L^. 

To prove the opposite implication, assume that t here exists an infinite zigzag 



between the languages L-^ and L^. Applying Lemma 19 first we obtain nonempty 



union-free mutually-embeddable languages M-^ C and M'^ C L®. Then, using 



Lemma 20 several times Lemma [21] and, finally. Lemma 22 we obtain languages 
K-^ and K'^ represented by saturated (thus also union free of star depth one and 
nonredundant) regular expressions that are mutually embeddable to the languages 
A/-^ and Al'^ , respectively. As the mutual-embeddability relation is transitive, K-^ 
and if ^ are mutually embeddable. Note that C M-^ and K'^ C A/^ as the ap- 
plication of Lemmas [20} [21] and [22| results in sublanguages of the original languages. 
To complete the proof, we show that they are synchronized. 

Consider the regular expressions r-^ and (with the properties listed above) 
for K'^ and K'^ , respectively, and denote the number of loops in by iji and in 
1"'^ by ig. We prove the rest of the lemma by induction on ij^ + i^. For iji + iB = 0, 
ii = 22 = and K-^ = {wi} and K'^ — {^2}, for some wi,W2 £ S*. As there exists 
an infinite zigzag between K'^ and K'^ , we have wi = W2 and, hence, K'^ and if® 
are synchronized in one step. Note that this is the place where we use that the 
languages arc not necessarily disjoint. 

Assume that -I- = fc > 0. Fix an alphabet Eo which is (r^, r^)-loop- 



maximal. Then, by Lemma 23 we obtain that the Eo-decomposition of r"^ and 
are 

= u^iv^yw^ si ui{vi)*wi ...sf ui{vf)*wi s^+i 

and 

B _ B B ( By B B B ( By B B B ( By B B 

where the languages L{sf-) and L{sf ) are mutually embeddable and saturated for all 
1 < i < fc-|-l. Thus, by induction hypothesis, all L{sf-) and L{sf) are synchronized. 
As, by definition, uf^{vf')*wf' and uf {vf)*wf are synchronized in one step, for all 
1 < i < fc, we have that r-^ and are synchronized, which completes the proof. □ 



Remaining Proofs of Section [4] 



Lemma |10[ For two NFAs A and B, the following conditions are equivalent. 

1. Automata A and B are synchronizable. 

2. There exist synchronized languages C L[A) and K'^ C L[B). 

Proof. The implication from left to right is immediate. To prove the opposite impli- 
cation, let ^ . . . D-^ and K'^ = D'^ . . . Df, where Df and Df are synchro- 
nized in one step, for all 1 < i < ^- Define the n-th canonical word of a singleton 
language as its unique word, and of a cycle language Wpref (wmid)*i'suff as the word 
^'pref ('!'mid)"wsuff . Let TV bc the maximum number of states of automata A and B, 
and let wf- and wf be the 7V-th canonical words of languages Df' and -Df , re- 
spectively, for \ < i < k. Let = • • • and = wf ■ ■ ■ w^. Notice that 
G K-^ C L{A) and € K'^ C L[B). Consider some of the accepting runs of 
A on and of B on , respectively, 

— >q2 — > ■■■ > C — > C+1 
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and 



9i > 92 > ■■■ > Qk > Qk+1 ■ 



By definition of run, states q-f and qf are initial respectively in A and B, and states 
Q'k+i ^^'^ ^'"^ accepting in A and i3, respectively. Thus, to show that A and 
B are synchronizable, it is sufficient to show that pairs {qf,qf) and {q^i,qf_^i) are 
synchronizable, for all 1 < i < fc- Fix some 1 < i < fc, then there are two cases. 
Either both Df^ and Df are singletons, or they are cycle languages. Consider first 
the situation when they are singletons. Then we have wf^ = wf G Df = Df and 
pairs {qf,qf) and {qf^i,qf_^_i) are clearly synchronizable. 

Focus now on the situation where Df and Df are cycle languages. In this case, 

= <refKld)*tdff and Df =v^,,,{vf,,rvl,,, 

for some w^ef > «^dJ ""^ff : f'pref > "f^mid- ^'fuff ^ S*; and 

^1 — ^prefl^^midi ^suff ^UQ Wj — Wpref l^mid^ ^suff • 

Consider a run of on wf from (7,^ to qf^i- It is of the form 

A ^P"l .A "m'td, A ^^d, "^^d, ^ "mid, A '"til, A 

Qi > ^0 ^ "^1 ^ • ■ ■ ^ ™jv-i ^ "^w ^ 9,:+i > 

for some states mf, for < j < iV. Notice that at least two among states 
mf, . . . , m-^ necessarily coincide, as automaton A has no more than N states. As- 
sume thus that for some 0<A:<£<A^we have mf = mf = m-^. Then 

_4 V=*^"-ld^ , A ("mid) , A ("mid) "s^ll A 

which shows that states qf and qf^^ are Alph(7;;^^)-connected in A, since we have 
Alph(t;^gf ) U Alph(?;^fj) C Alph(7;^^) by definition of languages synchronized in 
one step. Similarly we can show that qf and qf_^^ are Alph(7;^j^^)-connected in B. 
However, by definition of synchronization in one step, the cycle alphabets of Df and 
Df are the same, so Alph(w^jj) = Alph(w;4^). This shows that the pairs {qf,qf) 
and {qf^i, qf-^i) are synchronizable and completes the proof. □ 



Theorem 11, Let A and B be two NFAs. Then the languages L{A) and L{B) are 
separable by piecewise testable languages if and only if the automata A and B are 
not synchronizable. 

Proof. This theorem follows from the previous results. Namely, by Theorem [3j the 
languages L{A) and L{B) are separable by piecewise testable languages if and only 
if there is no infinite zigzag between them. Lemma |9] shows that the existence of 
a zigzag is equivalent to the existence of two synchronized sublanguages C 



L{A) and C L{B). Finally, by Lemma 10 the existence of two synchronized 
sublanguages is equivalent to the fact that the automata A and B are synchronizable, 
which concludes the proof. □ 



Theorem |12[ Given two NFAs A and B, it is possible to test in polynomial time 
whether L{A) and L{B) can be separated by a piecewise testable language. 
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Proof. By Theorem |11| it is enough to check whether A and B are synchronizable. 
Let A = {Q^,Y.,S^^,F^) and B = (Q^, E, (5^, 9^ , F^). We wiU consider the 
graph SYNCH for which the vertices are pairs of states of x and the edges cor- 
respond to pairs of vertices synchronizable in one step. Specifically, there is an edge 
— >■ {q-^, q^) in SYNCH if and only if {p-^,p^) and {q-^ , q'^) are synchronizable 
in one step. Thus, A and B are synchronizable if and only if a vertex consisting of 
accepting states is reachable in SYNCH from the pair of initial states (gff , (/q ). Since 
reachability is testable in PTIME, it is thus sufficient to show how we compute the 
edges of SYNCH. 

The definition of synchronizability in one step (page [s]) consists of two cases. 
We refer to the first case as symbol synchronization and to the second case as cycle 
synchronization. 

For two symbol-synchronizable pairs of states {p-^,p'^), {q-^, q'^), there should be 
an edge (p"^,_p^) (9^,9^) in SYNCH if there exists an a € S such that p-^ A 
and p'^ q'^ . Since it is easy to find all these pairs in polynomial time, these edges 
in SYNCH can be easily constructed. 

We now show how to construct the edges for cycle-synchronized states. For 
two pairs and {q'^,q'^) to be cycle-synchronizable, we require that p-^ and 

q-^ are So-connected in A, and p^ and q^ are Eg-connected in B, for (the same) 
Eq C E. We now rephrase this definition using other notions that will be useful in 
the algorithm. 

A pair of states (p^,p^) G Q"^ x has a saturated Tjq- cycle if there exist two 
words v^^v^ satisfying 

1. Alph(w-^) = Alph(wS) = Eo; 

2. p-^ p-^ in A] and 

3. p^ ^ p'^ in B. 

We say that there is a Eg-roMte from (p"^,_p^) G x to {q-^,q'^) G Q-^ x Q'^ 

A B 

if there exist words and in Eq such that p'^ — > q-^ and p'^ — > cf" . So, in 
contrast to saturated Eg-cycles, here we do not require that the alphabets Alph(u-^) 
and Alph(u^) are equal to Eq. 

Note that if a pair V = {q'^,q'^) has a saturated Eg-cycle and a saturated Ei- 
cycle, then it also has a saturated (Eq U Ei)-cycle (obtained by the concatenation of 
the two cycles). Thus, for every pair V = {q-^,q'^), there exists a unique maximal 
alphabet Eq C E such that it has a saturated Eg-cycle. (This unique maximal 
alphabet can be empty if no such saturated cycle exists.) We call this alphabet the 
saturated cycle alphabet of V and denote it by Eg'. This means that Vp = (p-^,p^) 
and Vq — (g-^, g^) are cycle synchronizable if and only if there is a F such that there 
are E^-routes from to F and from V to Vq. 

To find all the cycle synchronizable pairs we can first compute, for every V — 
(p^,p^), the saturated cycle alphabet E^. This can be done in polynomial time 
in the following manner. Let Cff and Cq be the strongly connected components of 
A and B containing p-^ and p^, respectively. For a strongly connected component 
C, let Alph(C) be the union of all symbols a of E that label transitions of the 
form p A- g, where both p and q belong to C. If Alph(Cff ) = Alph(C^), then 
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Sq' equals Alph(C(f). Otherwise, set Si = Alph(Co^) n Alph(C^) and consider 
automata Ai and Bi obtained from A and B by removing all transitions labeled 
by symbols from E \ Ei. Consider the strongly connected components Cj^ and Cf 
of Ai and Bi containing p-^ and p'^ , respectively, and proceed in the same way as 
before. Continuing this procedure we obtain a sequence of decreasing alphabets 
5^1 2 ^2 2 ■• -I hence we perform at most |S| iterations. If we arrive at the empty 
alphabet then we say ~ 0. 

We argue that we compute Sq' correctly. Clearly, if the algorithm returns a set 
E', then S' C Ej'. Conversely, we have that E^ C E' because, at each point in the 
algorithm, the alphabet under consideration contains Ej'. (In the first iteration, 
E^ C Alph(C^) and C Alph(C^). Furthermore, at each iteration i, if E^ C 
AlMCf ) and E^ C Alph(Cf ), then E^ C (Alph(C,^) n Alph(Cf )).) 

Once we know, for each pair V = {q-^, q^), its saturated cycle alphabet E^, we 
can find all vertices Vp such that there is a Ej'-route from to and all vertices 
Vq such that there is a Ej'-route from V to Vg, and add edges — ?> Vg to the 
graph SYNCH. This concludes the construction of SYNCH and the presentation of the 
algorithm. We note that our algorithm is clearly not yet time-optimal. □ 



Proofs of Section [5] 

The goal is to prove the following Theorem. 



Theorem 13. For O € {^,^s} o.'rid C being one o J single, unions, or boolean 
combinations, we have that the complexity of the separation problem by J-{0, C) is 
as indicated in Table\^ 



The Subsequence Order Cases 

Lemma 24. The separation problem by J'{di, single) is NP-complete. 

Proof. Let K and L be two regular languages over E given by NFAs. The problem 
is to find a word w in E* such that K C closure- (w) and L n closure- (w) = 0. By 
definition of the subsequence order -<, the maximal length of such a word w is equal 
to the length of a shortest word of K. Therefore, such a w cannot be longer than the 
size of the automaton for K. An NP algorithm can guess such a word w of length 
at most the size of the automaton and computes the minimal DFA for closure- (w). 
This minimal DFA corresponds to a "greedy" procedure for embedding w in a given 
string. That is, the states of this DFA correspond to the maximal prefix of w that 
can be embedded in the currently read string. It can be computed in polynomial 
time from w. Verifying ii K ^ closure- (w) and L H closure- (w) = then reduces 
to standard automata constructions that can be done in polynomial time. 

To show NP-hardness, we use a simple reduction of the longest common subse- 
quence problem, which is well known to be NP-hard [17^ A word w is a longest 
common subsequence of words {wij^^i ii w ^ Wi for all 1 < i < n and there is no 
longer word with this property. This word w is not necessarily unique (the longest 
common subsequence for ab and ba could be a or 6). By |17j . to determine whether 
the length of the longest common subsequences of words (wi)"^i is longer than a 
given k is NP-hard with respect to X^iLi ^'^'^ ^■ 
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Consider the DFA A that accepts the finite language K = {wi, . . . ,Wn} and 
the DFA B that accepts the language L of all words up to length fc — 1. Then we 
have that the existence of a common subsequence of longer than k is then 

equivalent to the possibility to separate K and L by single). Furthermore, 

we can construct A in time 0(X]r=i ^^'^ ^ time 0{k ■ ^O- Since both 

A and B are DFAs, we have shown that the problem even remains NP-hard if the 
input is given as DFAs instead of NFAs. □ 



Actually, using the proof of Lemma 24 we can prove the same result for union- 
free languages. 

Lemma 25. The separation problem by union-free languages is NP-complete. 

Proof. The proof of NP-hardness of the proof of Lemma [24] also applies to union- 
free languages since the language closure-(aia2 . . . a„) = T,* aiT,* 02^,* . . . E*a,iS* 
is union free. Indeed, any regular expression (61 H- 62 + • • • + b„i)* = {bl . . . 6*J* 
is union free. The NP algorithm guesses a word w as above and the positions and 
scopes of star operators. □ 

We now turn to separation by unions). 

Lemma 26. A language K is separable from a language L by J^{^, unions) if and 
only if there exist no words w Cz K and w' G L such that w ^ w' . 

Proof. If there exist w K and w' ^ L with w ^ w' , then any ^-closed language 
containing w also contains w'. Since unions of ^-closed languages are also ^-closed, 
we have that K is not separable from L by . 

The opposite implication follows directly from Claim [17] Observe that in this 
case layer(if , L) = K and every ^-closed language is a finite union of languages 
S*aiS* • • • S*a„E* due to Higman's lemma. □ 
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exist iff closure- {K) n L = 
0. An NFA for closure- (X) is obtained by adding self loops under all symbols in S 
to all states of the automaton for K . Emptiness of intersection is then decidable in 
polynomial time by standard methods. This gives the following lemma. 

Lemma 27. The separation problem by J-{^, unions) is in polynomial time. 



The SufRx Order Cases 

It remains to prove the cases for the suffix order :<s. Let lcs(L) denote the longest 
common suffix of all words of language L. 

Lemma 28. A language K is separable from a language L by T{^s, single) if and 
only if there is no word w' ^ L such that lcs{K) :<sw'. 

Proof. The separation problem asks to check the existence of a word w € S"'' such 
that K C and 'S,*w D L — (d. Obviously, if such a word exits, it must be a 

common suffix of all words from K. Assume that there is no w' G £ such that 
\cs{K) ^sw' ■ Then K is separable from L by the language E*lcs(iir). To show 
the opposite implication, assume that there exists a w' G L such that lcs(i^) <s w' . 
Then, for any common suffix w of K, it holds that w^gw' , which means that K is 
not separable from L by a language from J-{^s, single). □ 
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The word lcs(i^) can be computed from the automaton for K in polynomial time 
by inspecting paths that end up in accepting states. The length of \cs{K) is not 
larger than the length of the shortest word in K, hence linear with respect to the size 
of the automaton. To check whether there exists w' G L such that \cs{K) w' can 
be done in polynomial time by testing non-emptiness of the language T,*\cs{K) D L. 

Lemma 29. The separation problem by J-{^s, single) is in polynomial time. 

Lemma 30. A language K is separable from a language L by J^(^s, unions) if and 
only if the following two conditions are satisfied: 

1. there exist no words w Cz K and w' e L such that w ^^w' , 

2. there exists a natural number k > such that no words w € K and w' G L 
have a common suffix of length k. 

Proof. From left to right. Assume that K is separable from L by a language S = 
U"=i a w £ K and w G S then there is no w' G L such that w^sw' since 

S contains all words that are longer than w in and S O L — (d. Assume that for 
every number k there are words w € K and w' £ L with a common suffix of length 
k. Then, in particular, there are words w Cz K and w' Cz L with a common suffix 
of length max(|wi|, . . . , + 1. However, these words are either both inside S or 
both outside S, which contradicts that K is separable from L by S. This concludes 
the proof from left to right. 

For the other direction, assume that K and L satisfy conditions 1 and 2. Let 
M — {w £ Ti* \ \w\ < k and there is no w' G L such that w<s w'} and define 
S — UuieM ^*^- By definition, Sr\L = and S" is a finite union of suffix languages, 
i.e., a finite union of ^g-closures of words. We show that K C S. Indeed, let w £ K. 
If \w\ > k and t; is a suffix of w of length k then v belongs to M, which implies that 
uu G T,*v C S. If \w\ < k then w G M since there is no w' G L such that w <s w' . 
Thus, w G S, which completes the proof. □ 

We now argue that the two conditions in Lemma [30] can be tested in polynomial 
time, given NFAs for K and L. To check the first condition we test in polynomial 
time whether (S* K) H L is nonempty. To decide the second condition we compute 
the reversals vev{K) and rev(L) of languages K and L, respectively. This is done by 
reversing transitions in the corresponding automata and swapping the role of initial 
and accepting states. We note that this step may require an NFA to have more 
than one initial state, but NFAs are known to be sufficiently robust to allow this. 
Common suffixes of words from K and L are common prefixes of words from Tev{K) 
and rev(i). We then compute the language of all prefixes of words from rev(A') and 
rev(L) by making all the states accepting, thereby obtaining languages pref(rev(Ar)) 
and pref(rev(L)), respectively. The intersection / = pref(rev(A')) H pref(rev(L)) is 
the set of all words w e E* such that there are words w G K and w' G L with v 
and V ^sw' . To check the condition it is sufficient to test whether the language / 
is infinite, which can also be done in polynomial time. This leads to the following 
lemma. 

Lemma 31. The separation problem by J-{^s, union) is in polynomial time. 
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Lemma 32. A language K is separable from a language L by J-{-<s, be) if and only 
if there exists a natural number k > such that no words w £ K and w' €z L have 
a common suffix of length k. 

Proof. Assume that K is separable from L by a finite boolean combination of lan- 
guages . . . , S*u'„. Let k = max{\wi\, . . . , |w„|) + l. Note that, for all words wv 
and w'v with \v\ > k and all 1 < i < n, it holds that wv € T,*Wi if and only if w'v € 
I]*u'i. Thus, any words with a common suffix of length at least k cannot be sep- 
arated by the considered set of languages, which means that there are no words 
w € K and w' L with a common suffix of length k. 

To show the opposite implication, assume that there exists a natural number k 
satisfying the condition. Then, for every w G K, if \w\ < k, we can cover word w by 
the language {w} = E*u;\|J^g2 ll*aw. If > fc, then w € Ti*v, where v \s a suffix 
of w of length k. By the assumption that no words of K and L have a common 
suffix of length k we have that E*t; n L = 0, which completes the proof. □ 

It therefore follows that separability by J^(^s,bc) can be done with a simpli- 
fied version of the procedure for J^(^s, unions). Since the latter was already in 
polynomial time according to Lemma [3Tj we have the following lemma. 

Lemma 33. The separation problem by J-{disi i'c) is in polynomial time. 
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